Inspect data

Wordclouds

Before pre-processing:

Pre-processed Data:

TF-IDF

The statistic tf-idf (term frequency - inverse document frequency) is intended to measure how important a word is to a document in a collection (or corpus) of documents.

The inverse document frequency for any given term is defined as

\[ idf\text{(term)}=\frac{n_{\text{documents}}}{n_{\text{documents containing term}}} \]

Bigrams

Build Corpus

Structural Topic Model

Parties want the media agenda to be congruent with their own agenda to define the issue-based criteria on which they will be evaluated by voters ( Eberl, 2017 ).Thus, parties instrumentalize their press releases in order to highlight issues that they are perceived to be competent on, that they “own” and that are important to their voters ( Kepplinger & Maurer, 2004 ). Editors can select from this universe and decide which of these topics will be discussed in the news. In that sense the ideological content of a newspaper refers to the extent to which the topics promoted by the parties correlate with the topics discussed in the news articles.

To discover the latent topics in the corpus of press releases (1.942) and news articles (11.880), a structural topic modeling (STM) developed by Roberts (2016) is applied. The STM is an unsupervised machine learning approach that models topics as multinomial distributions of words and documents as multinomial distributions of topics, allowing to incorporate external variables that effect both, topical content and topical prevalence.

Select Model

STM assumes a fixed user-specified number of topics. There is not a “right” answer to the number of topics that are appropriate for a given corpus (Grimmer and Stewart 2013). Roberts et al. (2016) propose to measure topic quality through a combination of semantic coherence and exclusivity of words to topics. Semantic coherence is a criterion developed by Mimno et al. (2011) and is closely related to pointwise mutual information (Newman et al. 2010): it is maximized when the most probable words in a given topic frequently co-occur together.

Using the function searchK several automated tests are performed to help choose the number of topics including the average exclusivity and semantic coherence as well as the held out likelihood (Wallach et al. 2009) and the residuals (Taddy 2012).

Run Model

I included the document source as a control for the topical topical prevalence, assuming that the distribution of topics depends on the sources. The number of topics is set to 50.

Results

library(stm)
library(tidyverse)
library(ggthemes)
library(xtable)
library(viridis)

rm(list = ls())
color <- "#b7b7b7"
color1 <- "#778899"
color2 <- "#808080"
color3 <- "#000000"
source("func/functions.R")

load("../output/models/finalmodel_50.RDa")
k <- stmOut$settings$dim$K
  
model_df <- model_df %>%
  dplyr::mutate(doc_index = as.numeric(rownames(.)),
         source = ifelse(source == "welt.de", "DIE WELT", source),
         source = ifelse(source == "zeit.de", "ZEIT ONLINE", source),
         source = ifelse(source == "focus.de", "FOCUS Online", source),
         source = ifelse(source == "bild.de", "Bild.de", source),
         source = ifelse(source == "spiegel.de", "SPIEGEL ONLINE", source),
         
         source = ifelse(source == "union", "Union", source),
         source = ifelse(source == "spd", "SPD", source),
         source = ifelse(source == "afd", "AfD", source),
         source = ifelse(source == "gruene", "Grüne", source),
         source = ifelse(source == "linke", "Linke", source),
         source = ifelse(source == "fdp", "FDP", source)
         )

stmOut$settings$call
## stm(documents = out$documents, vocab = out$vocab, K = k, prevalence = ~source, 
##     content = ~type, data = out$meta, init.type = "Spectral", 
##     max.em.its = 75)
model_df %>%
  group_by(type) %>%
  tally()
## # A tibble: 2 x 2
##   type      n
##   <chr> <int>
## 1 news  15135
## 2 press  2666

Label topics

To explore the words associated with each topic we use the words with the highest probability in each topic. As we included the source type (press release or news paper) as a control for the topical content (the word distribution of each topic), we have two different labels for each topic.

sagelabs <- sageLabels(stmOut, 20)
for (i in seq(k)) {
  
  name <- paste0("topic_label",i)

  png(paste0('../figs/topiclabel/',name,'.png'), width = 400, height = 450)

  plotQuote(
    
    c(paste(sagelabs$covnames[1],":",
            paste(sagelabs$cov.betas[[1]]$problabels[i,], collapse="\n")),
      paste(sagelabs$covnames[2],":",
            paste(sagelabs$cov.betas[[2]]$problabels[i,], collapse="\n"))
      )
    )

  dev.off()

}
theta <- as.data.frame(stmOut$theta) %>% # get all theta values for each document
  
  mutate(doc_index = as.numeric(rownames(.))) %>%
  # convert to long format
  gather(topic, theta, -doc_index) %>%
  mutate(topic = as.numeric(gsub("V","",topic))) %>%
  
  # join with topic df
  left_join(., topics.df, by="topic") %>%
  
  # join with model_df
  left_join(., model_df %>% 
              select(date,type,source,doc_index,title_text), by="doc_index")
# select a random document
doc <- sample(unique(theta$doc_index),1)

sample <- theta %>% filter(doc_index == doc) 
caption <- model_df %>% filter(doc_index == doc) %>% select(title, source)

sample %>%
  ggplot(aes(joint_label, theta)) +
  geom_col(fill = color1) +
  coord_flip() +
  ylim(c(0,1)) +
  theme_hc() +
  labs(x = NULL, y = NULL, caption = paste("title:",caption$title,"(",caption$source,")"))

ggsave("../figs/doc_topic_distr.png", height = 8, width = 8)

Topic frequency

The expected proportion of the corpus that belongs to each topic is used to get an initial overview of the results. The figure below displays the topics ordered by their expected frequency across the corpus. The four most frequent words in each topic are used as a label for that topic.

overall_freq <- as.data.frame(colMeans(stmOut$theta)) %>%
  transmute(
    topic = as.numeric(rownames(.)),
    frequency = colMeans(stmOut$theta)
         ) %>%
  left_join(., topics.df, by = "topic") %>% 
  arrange(desc(frequency))%>%
  mutate(order = row_number())
overall_freq %>%
  ggplot(aes(reorder(joint_label, -order),frequency)) +
  geom_col(fill = color1) +
  coord_flip() +
  scale_fill_gradient(low = color2, high = color1) +
  theme_hc() +
  labs(x=NULL, y=NULL) 

ggsave("../figs/topic_proportion.png", height = 10, width = 10)

Measure Agendas

Agendas were measured in terms of percentage distributions across the 50 topics. For each source the average distribution of each topic is calculated for each month.

topicmean_monthly <- theta %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date)
    ) %>%
  filter(month != 5) %>%
  group_by(topic, joint_label, source,type, month, year) %>%
  summarise(topicmean = mean(theta, na.rm = T)) %>% 
  ungroup() %>%
  mutate(date = as.Date(paste0(year,"/",month,"/1"))) 
topicmean_news <- theta %>%
  filter(type == "news") %>%
  group_by(topic,joint_label,source) %>%
  summarise(topicmean = mean(theta, na.rm = T)) %>% 
  ungroup()

topicmean_press <- theta %>%
  filter(type == "press") %>%
  group_by(topic,joint_label, source) %>%
  summarise(topicmean = mean(theta)) %>% 
  ungroup()
topicmean_news %>%
  group_by(source) %>%
  arrange(desc(topicmean), .by_group = TRUE) %>%
  mutate(topic_order = row_number()) %>%
  ungroup() %>%
  
  group_by(joint_label) %>%
  mutate(topicmean_mean = mean(topicmean)) %>%
  ungroup() %>%
  top_n(70, topicmean_mean) %>%
  
  ggplot(aes(reorder(joint_label, topicmean_mean),
             topicmean, label = topic_order,
             fill = topic_order)) +
  geom_col(show.legend = F) +
  geom_text(hjust=-0.1, size=5) +
  coord_flip() +
  scale_fill_gradient(low = color1, high = color3) +
  facet_wrap(~source, nrow = 1) +
  labs(x=NULL, y=NULL) +
  theme(axis.text.y = element_text(size=12))

topicmean_press %>%
  group_by(source) %>%
  arrange(desc(topicmean), .by_group = TRUE) %>%
  mutate(topic_order = row_number()) %>%
  ungroup() %>%
  
  group_by(joint_label) %>%
  mutate(topicmean_mean = mean(topicmean)) %>%
  ungroup() %>%
  top_n(50, topicmean_mean) %>%
  
  ggplot(aes(reorder(joint_label, topicmean_mean),
             topicmean, label = topic_order,
             fill=topic_order)) +
  geom_col(show.legend = F) +
  geom_text(hjust=-0.1, size=5) +
  coord_flip() +
  scale_fill_gradient(low = color1, high = color3) +
  facet_wrap(~source, nrow = 1) +
  labs(x=NULL, y=NULL) +
  theme(axis.text.y = element_text(size=12))

Correlation of topic prevalence

For each source \(s\), we get a matrix \(\Theta_s\) as the collection of all documents (collection of column vectors \(\theta_{dk}\)).

\[ \Theta_s = \begin{bmatrix} \theta_{1} & ... & \theta_{d} \\ . & . & . \\ . & . & . \\ \theta_{k} & . & . \\ \end{bmatrix} \]

where \(\theta_{j}\) is the \(j\)-th column of \(\Theta_s\) for $j {1, … , d } $. The \(k × 1\) vector \(\theta_j\) gives the \(j\)-th document’s probability for the \(k\) topic.

E.g. for “DIE WELT” the following matrix of document-topic distributions is given:

theta %>% 
  filter(source == "DIE WELT") %>%
  filter(doc_index %in% seq(1,50)) %>%
  select(doc_index, topic, theta) %>%
  mutate(theta = round(theta, 3)) %>%
  spread(doc_index,theta) %>% select(-topic) %>% 
  htmlTable::htmlTable()
1 2 3 4 5 12 14 15 38 40 42
1 0.031 0.007 0.002 0.028 0.035 0.074 0.001 0.036 0 0 0.001
2 0.013 0.001 0.046 0.015 0.02 0.156 0 0.035 0.001 0.001 0.001
3 0.019 0 0.001 0.022 0.014 0.008 0 0.008 0 0 0.001
4 0.001 0 0.001 0.001 0.001 0.002 0 0.001 0 0 0.008
5 0.001 0.001 0 0.001 0.001 0.001 0 0.002 0.001 0.002 0
6 0.006 0 0 0.007 0 0.04 0 0.001 0 0 0
7 0.008 0.002 0.001 0.009 0.009 0.003 0 0.013 0 0 0.001
8 0.029 0 0.005 0.022 0.022 0.006 0 0.013 0 0 0.001
9 0.049 0.025 0.002 0.049 0.06 0.138 0.024 0.073 0 0 0.001
10 0.011 0.109 0.014 0.008 0.009 0.034 0.124 0.014 0.001 0.001 0
11 0.021 0 0.002 0.011 0.009 0.004 0 0.009 0 0 0.004
12 0 0 0 0 0 0.013 0 0 0 0 0
13 0.006 0.003 0 0.006 0.011 0.001 0 0.006 0 0 0
14 0.294 0 0 0.341 0.395 0.034 0 0.55 0 0 0.008
15 0 0 0.008 0 0 0.006 0 0 0 0 0
16 0.014 0.008 0.001 0.013 0.005 0.003 0 0.01 0 0 0
17 0.001 0.001 0.002 0.011 0 0.027 0.001 0 0 0 0.03
18 0 0.001 0.003 0.001 0.001 0.001 0 0.001 0 0 0.011
19 0.005 0.001 0 0.004 0.005 0.001 0 0.006 0 0 0.001
20 0 0.001 0 0.001 0.001 0.004 0 0 0 0 0
21 0.003 0.002 0 0.002 0.002 0.003 0.006 0.005 0 0 0.001
22 0.001 0 0.769 0 0 0.001 0 0 0.985 0.983 0.001
23 0.03 0 0.005 0.037 0.037 0.001 0 0.024 0 0 0.003
24 0 0.007 0.001 0 0 0 0.007 0 0.002 0.002 0
25 0 0 0 0 0 0.002 0 0 0 0 0
26 0.101 0 0.005 0.065 0.047 0.006 0 0.022 0 0 0.015
27 0.009 0.749 0 0.011 0.016 0.001 0.735 0.02 0 0 0
28 0.231 0 0 0.227 0.219 0 0 0.054 0 0 0
29 0 0.007 0 0 0 0 0.005 0 0 0 0
30 0 0 0 0 0 0.001 0 0 0 0 0.321
31 0 0 0 0 0 0 0 0 0.004 0.004 0
32 0 0 0 0 0 0.083 0 0 0 0 0.002
33 0.007 0.001 0.082 0.005 0.002 0.095 0.001 0.002 0.001 0.001 0.131
34 0.057 0.001 0.011 0.04 0.015 0 0 0.013 0 0 0.413
35 0 0 0 0 0 0.017 0 0 0.001 0.001 0
36 0 0 0 0 0 0 0 0 0 0 0
37 0.005 0.001 0 0.006 0.008 0.005 0.003 0.013 0 0 0.003
38 0.007 0 0 0.009 0.011 0.038 0 0.011 0 0 0
39 0.001 0.036 0 0.001 0.002 0.046 0.048 0 0 0 0
40 0 0 0.001 0 0.001 0.003 0 0 0 0 0
41 0.01 0 0.016 0.01 0.007 0.001 0 0.005 0 0 0.001
42 0.002 0.031 0.009 0.004 0.001 0.004 0.032 0.002 0 0 0.002
43 0.001 0.002 0.006 0 0 0.072 0.006 0 0 0 0.009
44 0 0 0 0 0 0.024 0 0 0 0 0.001
45 0 0 0 0 0 0.008 0 0 0 0 0.023
46 0 0 0.001 0 0 0.012 0 0 0.001 0.001 0
47 0 0 0 0 0.001 0.001 0 0.001 0 0 0
48 0.023 0 0 0.029 0.032 0.001 0 0.049 0 0 0
49 0 0 0 0 0 0.017 0 0 0 0 0.003
50 0 0 0 0 0 0.002 0 0 0 0 0

—> Group by source an topic: The mean for each topic is given by

\[ \bar{ \theta_{i} } = \sum^d_{j=1}\theta_{ij} \]

where $i {1, … , k } $

# calculate topic mean by source and month
topicmean <- theta %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date)
    ) %>%
  filter(month != 5) %>%
  group_by(topic,source) %>%
  dplyr::summarise(topicmean = mean(theta)) %>% 
  ungroup() %>%
  spread(source, topicmean) 

theta %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date)
    ) %>%
  filter(month != 5) %>%
  group_by(topic,source) %>%
  dplyr::summarise(topicmean = mean(theta)) %>% 
  ungroup() %>%
  mutate(topicmean = round(topicmean,4)) %>%
  spread(source, topicmean) %>%
  select(-topic) %>%
  htmlTable::htmlTable()
AfD B90/GRÜNE Bild.de CDU DIE LINKE DIE WELT FDP FOCUS Online SPD SPIEGEL ONLINE stern.de tagesschau.de ZEIT ONLINE
1 0.0176 7e-04 0.0508 0.007 0.0065 0.0376 0.0077 0.049 0.0177 0.0486 0.0385 0.0378 0.0545
2 0.0146 0.0063 0.038 0.0014 0.004 0.0212 0.0088 0.0285 0.0019 0.0362 0.0165 0.0184 0.0299
3 0.0111 0.0398 0.0243 0.0016 0.0081 0.025 0.0146 0.0231 0.0043 0.027 0.0326 0.0166 0.0339
4 0.0077 0.0062 0.0643 0.0067 0.0178 0.0504 0.0301 0.055 0.0095 0.0551 0.0362 0.0509 0.0573
5 0.0059 0.0021 0.0125 0.0048 0.0027 0.0011 0.006 8e-04 0.005 0.0023 0.0023 0.0012 0.0026
6 0.0209 0.0579 0.0148 0.0505 0.0388 0.0073 0.0499 0.0057 0.0288 0.0116 0.0118 0.034 0.0158
7 0.0169 3e-04 0.0167 0.0073 0.006 0.0161 0.0103 0.0106 0.0093 0.0102 0.0114 0.0148 0.0092
8 0.0079 0.0017 0.0262 4e-04 0.0143 0.0178 0.0023 0.0205 0.0052 0.024 0.0174 0.0204 0.0213
9 0.0185 0.0024 0.0301 0.0043 0.0061 0.0284 0.0113 0.0279 0.01 0.0279 0.0326 0.0179 0.0281
10 0.0151 0.0211 0.0186 0.0422 0.0307 0.0143 0.0142 0.0102 0.0065 0.0355 0.0125 0.0097 0.0111
11 0.005 0.014 0.0311 0.0241 0.0014 0.0248 0.0288 0.0328 0.0199 0.0301 0.0225 0.0258 0.0352
12 0.0151 0.0235 0.0059 0.0063 0.0281 0.0182 0.0013 0.0203 0.0086 0.0088 0.019 0.0124 0.0094
13 0.0141 0.0033 0.0064 0.0056 0.0098 0.0164 0.0048 0.0131 0.0041 0.0065 0.0062 0.0037 0.0084
14 0.0187 0.0074 0.0431 0.0153 0.0125 0.0552 0.0119 0.0382 0.0101 0.0374 0.0507 0.0292 0.0358
15 0.011 0.0044 0.0145 0.0089 0.0061 0.0072 0.0131 0.0049 0.0194 0.0047 0.0024 0.0068 0.0048
16 0.0062 0.0036 0.01 0.0121 0.0083 0.0097 0.0034 0.0101 0.0176 0.0087 0.0127 0.0085 0.0122
17 0.0472 0.0195 0.0271 0.0305 0.0514 0.0181 0.0888 0.0164 0.0353 0.0204 0.0352 0.042 0.028
18 0.0014 0.0039 0.0092 0.0117 0.0024 0.013 0.0019 0.0088 0.0095 0.0086 0.0124 0.0117 0.0096
19 0.1081 0.0019 0.0267 2e-04 7e-04 0.0265 5e-04 0.0197 0.002 0.0202 0.025 0.0161 0.0288
20 0.0455 0.0193 0.0119 0.0114 0.0342 0.0515 0.0303 0.0589 0.0115 0.0246 0.0261 0.0444 0.0117
21 0.012 0.0096 0.0104 0.0606 0.0141 0.0082 0.0173 0.006 0.0479 0.0099 0.0116 0.0121 0.0086
22 0.0071 0.0144 0.013 0.0379 0.01 0.0102 0.019 0.0171 0.0274 0.0179 0.0206 0.0079 0.0124
23 0.0043 4e-04 0.051 0.0028 5e-04 0.0506 0.2278 0.0377 5e-04 0.0461 0.0422 0.0412 0.0538
24 0.0453 0.0299 0.0161 0.0327 0.097 0.0146 0.0193 0.0197 0.0113 0.0315 0.0231 0.0219 0.0145
25 0.0101 0.0257 0.0183 0.0434 0.0151 0.0183 0.0058 0.0233 0.0507 0.0096 0.01 0.0192 0.0076
26 0.0108 0.0016 0.0264 0.0014 0.0038 0.0275 0.0099 0.0255 0.0057 0.028 0.0281 0.0261 0.0345
27 0.0201 0.0234 0.0093 0.0223 0.0436 0.0215 0.0152 0.031 0.0085 0.0419 0.0674 0.0151 0.0194
28 0.0465 0.0046 0.0237 2e-04 6e-04 0.0191 6e-04 0.0203 0.0012 0.0268 0.0268 0.0126 0.0352
29 0.0135 0.0187 0.0147 0.003 0.0246 0.0207 0.0082 0.0217 0.002 0.0171 0.0202 0.0269 0.0229
30 0.0123 0.0205 0.0145 0.0032 0.0413 0.0066 0.0151 0.0075 0.0234 0.0103 0.0076 0.03 0.0105
31 0.0133 0.0028 0.0072 0.0041 0.0021 0.0034 0.0033 0.0064 0.0047 0.0059 0.006 0.0068 0.0029
32 0.0138 0.0109 0.0223 0.0299 0.0233 0.0404 0.0117 0.0308 0.0341 0.0161 0.0296 0.0365 0.0217
33 0.0085 0.027 0.0407 0.0592 0.0125 0.0441 0.0104 0.0399 0.0445 0.0378 0.0192 0.0315 0.0546
34 0.0327 0.023 0.0246 0.0126 0.0125 0.0243 0.005 0.0261 0.0114 0.0323 0.023 0.0309 0.0346
35 0.0126 4e-04 0.0137 0.0158 0.001 0.0112 0.0014 0.0094 0.0011 0.0039 0.0034 0.0065 0.0036
36 0.0114 0.0269 0.0105 0.0405 0.0291 0.0255 0.0169 0.0476 0.054 0.0284 0.0299 0.0525 0.0096
37 0.005 0.0051 0.0094 0.0024 0.0105 0.0095 0.0038 0.0119 0.0071 0.0155 0.0127 0.0096 0.0164
38 0.0125 0.0018 0.033 0.0017 0.0014 0.0056 0.0062 0.0105 0.0028 0.0045 0.004 0.0041 0.0042
39 0.0544 0.0839 0.0118 0.0308 0.0369 0.0207 0.064 0.0181 0.0305 0.0276 0.0263 0.0177 0.0186
40 0.0571 0.0056 0.0083 0.0019 0.0032 0.0147 0.0011 0.0135 3e-04 0.0109 0.012 0.0091 0.0156
41 0.0041 0.004 0.0151 0.0366 0.0052 0.0149 0.0032 0.0149 0.0051 0.0156 0.0127 0.0078 0.0182
42 0.0185 0.0599 0.0116 0.0144 0.0198 0.0082 0.0213 0.0098 0.0279 0.0121 0.0167 0.0171 0.0181
43 0.0343 0.1337 0.0183 0.1098 0.061 0.0263 0.0443 0.0128 0.1003 0.0167 0.0188 0.0339 0.022
44 0.0648 0.001 0.014 0.0191 0.0072 0.0259 0.0165 0.0182 0.0032 0.0142 0.0123 0.0186 0.0253
45 0.0067 0.0025 0.0048 0.0133 0.0149 0.0128 0.0177 0.0141 0.0151 0.0217 0.0235 0.0111 0.0234
46 0.0215 0.0319 0.0279 0.0299 0.0075 0.0216 0.0033 0.0224 0.0223 0.0362 0.0073 0.0167 0.0079
47 0.0161 0.0096 0.0049 0.0021 0.0127 0.0023 4e-04 0.0039 0.0172 0.0029 0.0184 0.0038 0.0045
48 0.0017 7e-04 0.0126 1e-04 4e-04 0.0021 0.0016 0.0043 5e-04 6e-04 8e-04 5e-04 0.0036
49 0.0155 0.1754 0.0276 0.1149 0.1922 0.0264 0.0868 0.0161 0.2006 0.0073 0.0359 0.0472 0.0276
50 0.0051 0.0059 0.0019 0.0012 0.0058 0.0035 0.0025 0.0046 0.0027 0.0021 0.0058 0.0031 0.001

Next, I estimate bivariate correlations between party agendas (press releases) and the mediated party agendas in the online news. These correlations represent the agenda selectivity each party experiences in each media outlet. The higher the correlation, the more congruent both agendas are.

E.g. for “CDU” we get the following correlations:

media <- unique(model_df %>% filter(type == "news") %>% select(source))
parties <- unique(model_df %>% filter(type == "press") %>% select(source))

cor(topicmean[,media$source], topicmean[,"CDU"]) %>% htmlTable::htmlTable()
CDU
DIE WELT 0.111393975907849
stern.de 0.104683612695906
ZEIT ONLINE 0.035627367692623
FOCUS Online -0.009682026527168
Bild.de 0.00426245147510798
SPIEGEL ONLINE -0.00539349393457622
tagesschau.de 0.373370887419879
# calculate topic mean by source and month
topicmean <- theta %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date)
    ) %>%
  group_by(topic,source, month, year) %>%
  dplyr::summarise(topicmean = mean(theta)) %>% 
  ungroup() %>%
  spread(source, topicmean) %>%
  filter(month != 5)
media <- unique(model_df %>% filter(type == "news") %>% select(source))
parties <- unique(model_df %>% filter(type == "press") %>% select(source))
  
rm(corrDF)
for (i in parties$source) {
  
  tempdf <- topicmean %>%
    group_by(month, year) %>%
    do(data.frame(Cor=t(cor(.[,media$source], .[,i])))) %>%
    gather(medium, cor, 3:9) %>%
    mutate(party = i,
           medium = gsub("Cor.","",medium)) %>%
    ungroup()
  
  if (exists("corrDF")){
    corrDF <- rbind(corrDF,tempdf)
  } else {
    corrDF <- tempdf
  }
  
}

agenda <- corrDF %>% 
  mutate(date = as.Date(paste0(year,"/",month,"/1")),
         cor_norm = normalize_data2(cor)
         ) %>%
  dplyr::mutate(medium = ifelse(medium == "DIE.WELT", "DIE WELT", medium),
                medium = ifelse(medium ==  "ZEIT.ONLINE", "ZEIT ONLINE", medium),
                medium = ifelse(medium == "FOCUS.Online", "FOCUS Online", medium),
                medium = ifelse(medium == "SPIEGEL.ONLINE", "SPIEGEL ONLINE", medium)
  ) %>%
  filter(month != 5)
agenda %>%
  ggplot(aes(date, cor, 
             color = medium,
             linetype = medium)) +
  geom_line() +
  geom_hline(yintercept = 0, size = 0.3, color = color1) +
  facet_wrap(~party) +
  theme_hc() +
  scale_color_viridis_d() +
  labs(y=NULL, x =NULL) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b-%y") +
  theme(legend.position = "bottom",
        legend.title = element_blank()) +
     guides(col = guide_legend(nrow = 1))

#plotly::ggplotly(p, tooltip=c("cor","medium"))

ggsave("../figs/corr_timeline.png", height = 6, width = 10)
# normalized between -1 and 1
agenda_norm <- agenda %>%
  select(medium, cor, party, date) %>%
  spread(key = party, value = cor) %>%
  mutate(
    AfD = normalize_data2(AfD),
    `B90/GRÜNE` = normalize_data2(`B90/GRÜNE`),
    CDU = normalize_data2(CDU),
    `DIE LINKE` = normalize_data2(`DIE LINKE`),
    FDP = normalize_data2(FDP),
    SPD = normalize_data2(SPD)
  ) %>%
  gather(key = party, value = cor, AfD:SPD) 
ggplot(agenda_norm, 
       aes(date, cor, 
           color = medium,
           linetype = medium)) +
  geom_line() +
  geom_hline(yintercept = 0, size = 0.3, color = color1) +
  facet_wrap(~party) +
  theme_hc() +
  scale_color_viridis_d() +
  labs(y=NULL, x =NULL) +
  theme(legend.position = "bottom",
        legend.title = element_blank()) +
   guides(col = guide_legend(nrow = 1))

#plotly::ggplotly(p, tooltip=c("cor","medium"))

ggsave("../figs/corr_diff_timeline.png", height = 6, width = 10)

Radar chart

radar <- agenda %>%
  group_by(party, medium) %>%
  summarize(cor = mean(cor, na.rm = T)) %>%
  spread(key = party, value = cor) 

ggiraphExtra::ggRadar(radar, aes(color = medium),
                      rescale = F,
                      alpha = 0) + 
  theme_hc() +
  scale_color_viridis_d() +
  theme(legend.position = "right",
        #axis.text.y = element_blank(),
        #axis.ticks.y = element_blank(),
        legend.title = element_blank()) +
   guides(col = guide_legend(ncol = 1))

ggsave("../figs/radar1.png", width = 16, height = 9, dpi = 120)

Deviation from average visibility

Due to political relevance, not all potential topics recieve equal amounts of coverage in media. However, these factors should influence all media outlets equally. To what extent does the topic correlation of a party in a medium differ from the average topic correlation in the media? To calculate the relative topic correlation, I estimate the deviation of the topic correlation of a party in one medium from the average topic correlation of that party over all news paper.

ggiraphExtra::ggRadar(radar, aes(color = medium),
                      alpha = 0) + 
  theme_hc() +
  scale_color_viridis_d() +
  theme(legend.position = "none",
        legend.title = element_blank()) +
   guides(col = guide_legend(ncol = 1))

ggsave("../figs/radar_diff.png", width = 16, height = 9, dpi = 120)